Amendment Pursuant to 37 C.F.R. § 1.1 14 
Docket No. 013.0207.US.UTL 



Amendment to the Claims 

This listing of claims will replace all prior versions, and listings, of claims in the 
application: 

1 Claims 1-18 (canceled). 

1 19. (currently amended) A system according to Claim 18, furth e r for 

2 providing efficient document scoring of concepts within and clustering of 

3 documents in an electronically-stored document set, comprising: 

4 [[the]] a database electronically storing a document set: 

5 a scoring module e valuating th e score scoring a document in the 

6 electronically-stored document set, comprising: 

7 a frequency submodule determining a frequency of occurrence of 

8 at least one concept within a document: 

9 a concept weight submodule analyzing a concept weight reflecting 

10 a specificity of meaning for the at least one concept within the document, wherein 

1 1 the concept weight is based on a number of terms for the at least one concept; 

12 a structural weight submodule analyzing a structural weight 

13 reflecting a degree of significance based on structural location within the 

14 document for the at least one concept; 

15 a corpus weight submodule analyzing a corpus weight inversely 

16 weighing a reference count of occurrences for the at least one concept within the 

17 document; 

18 a scoring evaluation submodule evaluating a score to be associated 

19 with the at least one concept as a function of a summation of the frequency, 

20 concept weight, structural weight, and corpus weight in accordance with the 

21 formula: 

j 

22 S, = fy X CW y X SWy X HV y 
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23 where S/ comprises the score,/; comprises the frequency, 0 < cwy < 1 comprises 

24 the concept weight, 0 < swy < 1 comprises the structural weight, and 0 < rwy < 1 

25 comprises the corpus weight for occurrence j of concept /; 

26 a vector submodule forming the score assifined to the at least one 

27 concept as a normalized score vector for each such document in the 

28 electronically-stored document set; and 

29 a determination submodule determining a similarity between the 

30 normalized score vector for each such document as an inner product of each 

31 normalized score vector; 

32 a clustering module grouping the documents by the score into a plurality 

33 of clusters, comprising: 

34 a selection submodule selecting a set of candidate seed documents 

35 from the electronically-stored document set; 

36 a cluster seed submodule identifying seed documents by applying 

37 the similarity to each such candidate seed document and selecting those candidate 

38 seed documents that are sufficiently unique from other candidate seed documents 

39 as the seed documents; 

40 an identification submodule identifying a plurality of non-seed 

41 documents; 

42 a comparison submodule determining the similarity between each 

43 non-seed document and a cluster center of each cluster; and 

44 a clustering submodule assigning each such non-seed document to 

45 the cluster with a best fit, subject to a minimum fit; 

46 a threshold module relocating outlier documents, comprising determining 

47 the similarity between each of the documents grouped into each cluster based on 

48 the center of the cluster and the scores assigned to each of the at least one 

49 concepts in that document, dynamically determining a threshold for each cluster 

50 as a function of the similarity between each of the documents, and identifying and 

51 reassigning each of the documents with the similarity falling outside the 

52 threshold; and 
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a processor to execute the modules and submodules . 

20. (previously presented) A system according to Claim 1 9, further 
comprising: 

the concept weight module evaluating the concept weight in accordance 
with the formula: 



where cwy comprises the concept weight and t i} comprises the number of terms for 
occurrence j of each such concept i. 

2 1 . (previously presented) A system according to Claim 1 9, further 
comprising: 

the structural weight module evaluating the structural weight in 
accordance with the formula: 
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where swy comprises the structural weight for occurrence j of each such concept i. 

22. (previously presented) A system according to Claim 19, further 
comprising: 

the corpus weight module evaluating the corpus weight in accordance with 
the formula: 
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6 where rwy comprises the corpus weight, r i} comprises a reference count for 

7 occurrence./ of each such concept /, T comprises a total number of reference 

8 counts of documents in the document set, and M comprises a maximum reference 

9 count of documents in the document set. 

1 23. (previously presented) A system according to Claim 19, further 

2 comprising: 

3 a compression module compressing the score in accordance with the 

4 formula: 

5 s; = io g (s ( +l) 

6 where S' comprises the compressed score for each such concept /. 

1 24. (currently amended) A system according to Claim 1 8 Claim 19 , 

2 further comprising: 

3 a global stop concept vector cache maintaining concepts and terms; and 

4 a filtering module filtering selection of the at least one concept based on 

5 the concepts and terms maintained in the global stop concept vector cache. 

1 25. (currently amended) A system according to Claim 1 8 Claim 19 , 

2 further comprising: 

3 a parsing module identifying terms within at least one document in the 

4 document set, and combining the identified terms into one or more of the 

5 concepts. 

1 26. (original) A system according to Claim 25, further comprising: 

2 the parsing module structuring each such identified term in the one or 

3 more concepts into canonical concepts comprising at least one of word root, 

4 character case, and word ordering. 

1 27. (original) A system according to Claim 25, wherein at least one of 

2 nouns, proper nouns and adjectives are included as terms. 
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1 Claims 28-30 (canceled). 

1 31. (currently amended) A system according to Claim 1 8 Claim 19 , 

2 further comprising: 

3 the similarity submodule calculating the similarity in accordance with the 

4 formula: 

{s.-s.) 

6 where coscr^ comprises a similarity between a document A and a document B, 

7 S A comprises a score vector for document A, and S B comprises a score vector for 

8 document B. 

1 Claims 32-35 (canceled). 

1 36. (currently amended) A computer-implemented method according 

2 to Claim 35. furth e r for providing efficient document scoring of concepts within 

3 and clustering of documents in an electronically-stored document set, comprising: 

4 evaluating th e scor e scoring a document in an electronically-stored 

5 document set, comprising: 

6 determining a frequency of occurrence of at least one concept 

7 within a document; 

8 analyzing a concept weight reflecting a specificity of meaning for 

9 the at least one concept within the document, wherein the concept weight is based 

10 on a number of terms for the at least one concept; 

11 analyzing a structural weight reflecting a degree of significance 

12 based on structural location within the document for the at least one concept; 

13 analyzing a corpus weight inversely weighing a reference count of 

14 occurrences for the at least one concept within the document; and 
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15 evaluating a score to be associated with the at least one concept as 

16 a function of a summation of the frequency, concept weight, structural weight, 

17 and corpus weight and in accordance with the formula: 

18 Si = J] fy x cWy x sw tJ x rWy 

19 where 5, comprises the score,/^, comprises the frequency, 0 < cwy < 1 comprises 

20 the concept weight, 0 < swy < 1 comprises the structural weight, and 0 < rwy < 1 

21 comprises the corpus weight for occurrence j of concept 

22 forming the score assigned to the at least one concept as a normalized 

23 score vector for each such document in the electronically-stored document set; 

24 determining a similarity between the normalized score vector for each 

25 such document as an inner product of each normalized score vector; 

26 grouping the documents by the score into a plurality of clusters, 

27 comprising: 

28 selecting a set of candidate seed documents from the 

29 electronically-stored document set; 

30 identifying seed documents by applying the similarity to each such 

31 candidate seed document and selecting those candidate seed documents that are 

32 sufficiently unique from other candidate seed documents as the seed documents; 

33 identifying a plurality of non-seed documents; 

34 determining the similarity between each non-seed document and a 

35 center of each cluster; and 

36 assigning each non-seed document to the cluster with a best fit, 

37 subject to a minimum fit; and 

38 relocating outlier documents, comprising: 

39 determining the similarity between each of the documents grouped 

40 into each cluster based on the center of the cluster and the scores assigned to each 

41 of the at least one concepts in that document; 

42 dynamically determining a threshold for each cluster as a function 

43 of the similarity between each of the documents; and 
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44 identifying and reassigning each of the documents with the 

45 similarity falling outside the threshold . 

1 37. (currently amended) A computer-implemented method according 

2 to Claim 36, further comprising: 

3 evaluating the concept weight in accordance with the formula: 

f0.25 + (0.25x^)t l£/„£3 

4 cw„ = 1 0.25 + (0.25 x [7 - 1„ J 4 < t y < 6 

[0.25, t (j > 7 

5 where cwy comprises the concept weight and ty comprises the number of terms for 

6 occurrence j of each such concept /. 

1 38. (currently amended) A computer-implemented method according 

2 to Claim 36, further comprising: 

3 evaluating the structural weight in accordance with the formula: 
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5 where swy comprises the structural weight for occurrence j of each such concept i. 

1 39. (currently amended) A computer-implemented method according 

2 to Claim 36, further comprising: 

3 evaluating the corpus weight in accordance with the formula: 

4 rw^ly T J 

[l.O, r tJ <M 

5 where rwy comprises the corpus weight, ry comprises a reference count for 

6 occurrence j of each such concept i, T comprises a total number of reference 
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7 counts of documents in the document set, and M comprises a maximum reference 

8 count of documents in the document set. 

1 40. (currently amended) A computer-implemented method according 

2 to Claim 36, further comprising: 

3 compressing the score in accordance with the formula: 

4 s; = io g (s ( +i) 

5 where S' comprises the compressed score for each such concept i. 

1 41 . (currently amended) A computer-implemented method according 

2 to Claim 35 Claim 36 , further comprising: 

3 maintaining concepts and terms in a global stop concept vector cache; and 

4 filtering selection of the at least one concept based on the concepts and 

5 terms maintained in the global stop concept vector cache. 

1 42. (currently amended) A computer-implemented method according 

2 to Claim 35 Claim 36 , further comprising: 

3 identifying terms within at least one document in the document set; and 

4 combining the identified terms into one or more of the concepts. 

1 43. (currently amended) A computer-implemented method according 

2 to Claim 42, further comprising: 

3 structuring each such identified term in the one or more concepts into 

4 canonical concepts comprising at least one of word root, character case, and word 

5 ordering. 

1 44. (currently amended) A computer-implemented method according 

2 to Claim 42, further comprising: 

3 including as terms at least one of nouns, proper nouns and adjectives. 
1 Claims 45-47 (canceled). 



Amendment 



-9- 



Amendment Pursuant to 37 C.F.R. § 1.1 14 
Docket No. 013.0207.US.UTL 



1 48. (currently amended) A computer-implemented method according 

2 to Claim 35 Claim 36 , further comprising: 

3 calculating the similarity in accordance with the formula: 

(VS.) 

5 where cos a AB comprises a similarity between a document A and a document B, 

6 S A comprises a score vector for document A, and S B comprises a score vector for 

7 document B. 

1 Claims 49-51 (canceled). 

1 52. (currently amended) A computer-readable storage medium holding 

2 code for providing efficient document scoring of concepts within and clustering 

3 of documents in an electronically-stored document set, comprising: 

4 code for scoring a document in an electronically-stored document set, 

5 comprising: 

6 code for determining a frequency of occurrence of at least one 

7 concept within a document; 

8 code for analyzing a concept weight reflecting a specificity of 

9 meaning for the at least one concept within the document, wherein the concept 

1 0 weight is based on a number of terms for the at least one concept; 

1 1 code for analyzing a structural weight reflecting a degree of 

12 significance based on structural location within the document for the at least one 

1 3 concept; 

14 code for analyzing a corpus weight inversely weighing a reference 

1 5 count of occurrences for the at least one concept within the document; and 

16 code for evaluating a score to be associated with the at least one 

17 concept as a function of a summation of the frequency, concept weight, structural 

1 8 weight, and corpus weight in accordance with the formula: 
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19 = ^fjj x cWy x sw tJ x rw tf . 



20 where £ comprises the score, f u comprises the frequency, 0 < cw u < 1 comprises 

21 the concept weight, 0 < swg < 1 comprises the structural weight, and 0 < rw , j < 1 

22 comprises the corpus weight for occurrence / of concept / ; 

23 code for forming the score assigned to the at least one concept as a 

24 normalized score vector for each such document in the electronically-stored 

25 document set; 

26 code for determining a similarity between the normalized score vector for 

27 each such document as an inner product of each normalized score vector; 

28 code for grouping the documents by the score into a plurality of clusters, 

29 comprising: 

30 code for selecting a set of candidate seed documents from the 

3 1 electronically-stored document set; 

32 code for identifying seed documents by applying the similarity to 

33 each such candidate seed document and selecting those candidate seed documents 

34 that are sufficiently unique from other candidate seed documents as the seed 

35 documents; 

36 code for identifying a plurality of non-seed documents; 

37 code for determining the similarity between each non-seed 

38 document and a center of each cluster; and 

39 code for assigning each non-seed document to the cluster with a 

40 best fit, subject to a minimum fit; and 

41 code for relocating outlier documents, comprising: 

42 code for determining the similarity between each of the documents 

43 grouped into each cluster based on the center of the cluster and the scores 

44 assigned to each of the at least one concepts in that document; 

45 code for dynamically determining a threshold for each cluster as a 

46 function of the similarity between each of the documents; and 
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47 code for identifying and reassigning each of the documents with 

48 the similarity falling outside the threshold. 

1 53 . (currently amended) An apparatus for providing efficient 

2 document scoring of concepts within and clustering of documents in an 

3 electronically-stored document set, comprising: 

4 means for scoring a document in an electronically-stored document set, 

5 comprising: 

6 means for determining a frequency of occurrence of at least one 

7 concept within a document; 

8 means for analyzing a concept weight reflecting a specificity of 

9 meaning for the at least one concept within the document, wherein the concept 

1 0 weight is based on a number of terms for the at least one concept; 

1 1 means for analyzing a structural weight reflecting a degree of 

12 significance based on structural location within the document for the at least one 

1 3 concept; 

1 4 means for analyzing a corpus weight inversely weighing a 

1 5 reference count of occurrences for the at least one concept within the document; 

16 and 

1 7 means for evaluating a score to be associated with the at least one 

1 8 concept as a function of a summation of the frequency, concept weight, structural 

19 weight, and corpus weight in accordance with the formula: 

20 S t = ^fij x cwy x sw tJ x rw y 

\-*n ' 

21 where £ comprises the score, f g comprises the frequency, 0 < cw ,j < 1 comprises 

22 the concept weight, 0 < sw ,± < 1 comprises the structural weight, and 0 < rw_u ^ 1 

23 comprises the corpus weight for occurrence / of concept / ; 

24 means for forming the score assigned to the at least one concept as a 

25 normalized score vector for each such document in the electronically-stored 

26 document set; 
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27 means for determining a similarity between the normalized score vector 

28 for each such document as an inner product of each normalized score vector; 

29 means for grouping the documents by the score into a plurality of clusters, 

30 comprising: 

3 1 means for selecting a set of candidate seed documents from the 

32 electronically-stored document set; 

33 means for identifying seed documents by applying the similarity to 

34 each such candidate seed document and selecting those candidate seed documents 

35 that are sufficiently unique from other candidate seed documents as the seed 

36 documents; 

37 means for identifying a plurality of non-seed documents; 

38 means for determining the similarity between each non-seed 

39 document and a center of each cluster; and 

40 means for assigning each non-seed document to the cluster with a 

41 best fit, subject to a minimum fit; and 

42 means for relocating outlier documents, comprising: 

43 means for determining the similarity between each of the 

44 documents grouped into each cluster based on the center of the cluster and the 

45 scores assigned to each of the at least one concepts in that document; 

46 means for dynamically determining a threshold for each cluster as 

47 a function of the similarity between each of the documents; and 

48 means for identifying and reassigning each of the documents with 

49 the similarity falling outside the threshold. 



Amendment 



- 13 - 



